-
Notifications
You must be signed in to change notification settings - Fork 69
Synthetic Dataset: Prefix Caching Controls #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c12d0b
to
94a4508
Compare
20b660b
to
081cdab
Compare
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Co-authored-by: Mehul <MEHTMEHUL@GMAIL.COM> Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
473d097
to
692589c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements prefix caching controls for the synthetic dataset generator by adding configurable prefix buckets and ensuring unique prompts. The implementation allows control of token prefix cache rates through shared prefixes across samples while maintaining prompt uniqueness.
Key changes include:
- Added
PrefixBucketConfig
for configurable prefix generation with bucket weights, prefix counts, and token lengths - Modified prompt generation to include auto-incrementing unique prefixes and configurable shared prefixes
- Updated documentation to reflect the new prefix configuration options
Reviewed Changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/guidellm/dataset/synthetic.py | Core implementation of prefix bucket configuration and modified prompt generation logic |
src/guidellm/dataset/init.py | Exported new PrefixBucketConfig class |
tests/unit/dataset/test_synthetic.py | Comprehensive test suite for new prefix functionality |
docs/datasets.md | Updated documentation with prefix_tokens parameter |
Comments suppressed due to low confidence (1)
docs/datasets.md:79
- The documentation describes a
prefix_tokens
parameter, but the implementation usesprefix_buckets
with a more complex structure. This documentation appears to be outdated or incorrect for the current implementation.
- `prefix_tokens`: Number of tokens to share as a prefix across all prompts. Is additive to the prompt tokens distribution so each request is `prefix_tokens + prompt_tokens_sample()`. If unset, defaults to 0.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
558cc78
to
f254066
Compare
Summary
Work to allow control of token prefix cache rates with the synthetic data generator. Firstly adds an auto-incrementing single token prefix to ensure we never repeat the same prefix. Secondly adds controls for sharing a fixed prefix between samples.
Details
1. Ensure every prompt is unique
When generating a prompt, the first token is now taken from an iterator over the tokenizer vocab.
2. Add configurable prefix to simulate system prompt or other common token prefixes
Example usage:
Test Plan
pytest tests/unit/dataset
)Related Issues
Use of AI
## WRITTEN BY AI ##
)